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CN 1 Abstract 

We present a randomized iterative algorithm that exponentially converges in expectation to the mini- 
mum Euclidean norm least squares solution of a given linear system of equations. The expected number 



of arithmetic operations required to obtain an estimate of given accuracy is proportional to the square 
condition number of the system multiplied by the number of non-zeros entries of the input matrix. The 
proposed algorithm is an extension of the randomized Kaczmarz method that was analyzed by Strohmer 

■ and Vershynin. 

^ ■ 1 Introduction 

The Kaczmarz method is an iterative projection algorithm for solving linear systems of equations HKac37| . 
Due to its simplicity, the Kaczmarz method has found numerous applications including image recon- 
struction, distributed computation and signal processing to name a few IFCM+921 IrfegQl INatOll IFZ121 . 
see [Cen81[ for more applications. The Kaczmarz method has also been rediscovered in the field of im- 
age reconstruction and called ART (Algebraic Reconstruction Technique) BGBH70I , see also |CZ971lHer80l 
for additional references. It has been also applied to more general settings, see ICen811 Table 1] and 
BTom55llMcC75l for non-linear versions of the Kaczmarz method. 
V/~j , Let A G Rmx-n anc j k g j^m Throughout the paper all vectors are assumed to be column vectors. The 

■ Kaczmarz method operates as follows: Initially, it starts with an arbitrary vector x' ' 6 R™. In each iteration, 
\ the Kaczmarz method goes through the rows of A in a cyclic manner and for each selected row, say i-th 

£Nj ■ row A^', it orthogonally projects the current estimate vector onto the affine hyperplane defined by the i-th 

constraint of Ax = b, i.e., {x | (Aw, x) = bi} where (•, •) is the Euclidean inner product. More precisely, 
assuming that the ifc-th row has been selected at fc-th iteration, then the (k + l)-th estimate vector x' fc+1 - ) is 
inductively defined by 

rS ■ h. _ /A{ik) x (fc)\ 

b ' x( fc+1 > := x« + X k {A '* } - A M 

\\ A{ ' k) \\l 

where Xk € R are the so-called relaxation parameters and |j-|| 2 denotes the Euclidean norm. The original 
Kaczmarz method corresponds to Xk = 1 for all k > and all other setting of Afc's are usually referred as 
the relaxed Kaczmarz method in the literature IICen81ilGal03| . 

Kaczmarz proved that this process converges to the unique solution for square non-singular matri- 
ces IIKac37l , but without any attempt to bound the rate of convergence. Bounds on the rate of convergence of 
the Kaczmarz method are given in IMcC75L HAns84l and IGal03l Theorem 4.4, p.120]. In addition, an error 
analysis of the Kaczmarz method under the finite precision model of computation is given in | Kni93 , K ni96ll . 



*Anastasios Zouzias is with the Department of Computer Science at the University of Toronto, Canada. E-mail: 
zouziasScs . toronto . edu. Part of this work was done while the author was visiting the Department of Computer Science at 
Princeton University. 

t Nikolaos M. Frerisis with IBM Research - Zurich, Saumerstrasse 4, 8803 Rtischlikon, Switzerland. E-mail: nif@zurich. ibm. com 
1 That is, selecting the indices of the rows from the sequence 1, 2, . . . , m, 1, 2, 



1 



Nevertheless, the Kaczmarz method converges even if the linear system Ax = b is overdetermined 
(to > n) and has no solution. In this case and provided that A has full column rank, the Kaczmarz method 
converges to the least squares estimate. This was first observed by Whitney and Meany IWM67I who 
proved that the relaxed Kaczmarz method converges provided that the relaxation parameters are within 
[0, 2] and X k ->■ 0, see also IICEG831 Theorem 1], |Tan71| and IHN90I for additional references. 

In the literature there was empirical evidence that selecting the rows non- uniformly at random may be 
more effective than selecting the rows via Kaczmarz's cyclic manner IHM93llFCM+92l . Towards explain- 
ing such an empirical evidence, Strohmer and Vershynin proposed a simple randomized variant of the 
Kaczmarz method that has exponential convergence in expectation fS~V09| assuming that the linear system 
is solvable; see also IILL10I for extensions to linear constraints. A randomized iterative algorithm that com- 
putes a sequence of random vectors x( ) , x' 1 ) , . . . is said to converge in expectation to a vector x* if and only 

if E ||x( fe ) — x*||^ — s> as k — > oo, where the expectation is taken over the random choices of the algorithm. 
Soon after ISV09I , Needell analyzed the behavior of the randomized Kaczmarz method for the case of full 
column rank linear systems that do not have any solution [NeelO]. Namely, Needell proved that the ran- 
domized Kaczmarz estimate vector is (in the limit) within a fixed distance from the least squares solution 
and also that this distance is proportional to the distance of b from the column space of A. In other words, 
Needell proved that the randomized Kaczmarz method is effective for least squares problems whose least 
squares error is negligible. 

In this paper we present a randomized iterative least squares solver (Algorithm [3) that converges in 
expectation to the minimum Euclidean norm solution of 

min II Ax- b|L . (1) 

The proposed algorithm is based on ISV09I INeelOl and inspired by ]Pop99| . More precisely the proposed 
algorithm can be thought of as a randomized variant of Popa's extended Kaczmarz method ]Pop99| , there- 
fore we named it as randomized extended Kaczmarz. 

Organization of the paper In Section [2J we briefly discuss related work on the design of deterministic 
and randomized algorithms for solving least squares problems. In Section |3j we present a randomized 
iterative algorithm for projecting a vector onto a subspace (represented as the column space of a given 
matrix) which may be of independent interest. In addition, we discuss the convergence properties of the 
randomized Kaczmarz algorithm for solvable systems (Section [32) and recall its analysis for non-solvable 
systems (Section [33). In SectionH) we present and analyze the randomized extended Kaczmarz algorithm. 
Finally, in Section|5]we provide a numerical evaluation of the proposed algorithm. 



2 Least squares solvers 

In this section we give a brief discussion on least squares solvers including deterministic direct and iterative 
algorithms together with recently proposed randomized algorithms. For a detailed discussion on determin- 
istic methods, the reader is referred to | Bj96j . In addition, we place our contribution in context with prior 
work. 



Deterministic algorithms In the literature, several methods have been proposed for solving least squares 
problems of the form Q). Here we briefly describe a representative sample of such methods including 
the use of QR factorization with pivoting, the use of the singular value decomposition (SVD) and iterative 
methods such as Krylov subspace methods applied on the normal equations [Saa03|. LAPACK provides ro- 
bust implementations of the first two methods; DGELSY uses QR factorization with pivoting and DGELSD 
uses the singular value decomposition ||ABD + 90| . For the iterative methods, LSQR is equivalent to apply- 
ing the conjugate gradient method on the normal equations [PS82] and it is a robust and numerically stable 
method. 
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Randomized algorithms To the best of our knowledge, most randomized algorithms proposed in the 
theoretical computer science literature for approximately solving least squares are mainly based on the fol- 
lowing generic two step procedure: first randomly (and efficiently) project the linear system into sufficiently 
many dimensions, and second return the solution of the down-sampled linear system as an approximation 
to the original optimal solution HDMM061 ISar061 IC W091 INDT091 QyfeTTl IDMMS1 1 II , see also ICW12I . Con- 
centration of measure arguments imply that the optimal solution of the down-sampled system is close to 
the optimal solution of the original system. The accuracy of the approximate solution using this approach 
depends on the sample size and to achieve relative accuracy e, the sample size should depend inverse 
polynomially on e. This makes these approaches unsuitable for the high-precision regime of error that is 
considered here. 

A different approach is the so called randomized preconditioning method, see IIRT081 lAMTlOI . The 
authors of [AMT10] implemented Blendenpik, a high-precision least squares solver. Blendenpik consists of 
two steps. In the first step, the input matrix is randomly projected and an effective preconditioning matrix 
is extracted from the projected matrix. In the second step, an iterative least squares solver such as the LSQR 
algorithm of Paige and Saunders [PS82) is applied on the preconditioned system. Blendenpik is effective 
for overdetermined and underdetermined problems. 

A parallel iterative least squares solver based on normal random projections called LSRN was recently 
implemented by Meng, Saunders and Mahoney IM SM11I . LSRN consists of two phases. In the first pre- 
conditioning phase, the original system is projected using random normal projection from which a pre- 
conditioner is extracted. In the second step, an iterative method such as LSQR or the Chebyshev semi- 
iterative method |GV61| is applied on the preconditioned system. This approach is also effective for over- 
determined and under-determined least squares problems assuming the existence of a parallel computa- 
tional environment. 

2.1 Relation with our contribution 

In Section|5l we compare the randomized extended Kaczmarz algorithm against DGELSY, DGELSD, Blenden- 
pik. LSRN BMSMIH did not perform well under a setup in which no parallelization is allowed, so we do 
not include LSRN's performance. The numerical evaluation of Section [5] indicates that the randomized ex- 
tended Kaczmarz is effective on the case of sparse, well-conditioned and strongly rectangular (both overde- 
termined and underdetermined) least squares problems, see Figure [U Moreover, the randomized extended 
Kaczmarz algorithm has also comparable performance with LAPACK's routine for the dense random in- 
put matrices, see Figure |2] (notice that the proposed algorithm almost matches Blendenpik's performance 
for the underdetermined case, see Figure |2(b")j l. On the other hand, a preconditioned version of the proposed 
algorithm does not perform well under the case of ill-conditioned matrices, see Figure |3] 



3 Background 

Preliminaries and Notation For an integer m, let [m] := {1, . . . , m}. Throughout the paper all vectors are 
assumed to be column vectors. We denote the rows and columns of A by A' 1 ' , ... , A^ m ' and Am, ... , A( n ), 
respectively (both viewed as column vectors). TZ(A) denotes the column space of A, i.e., TZ(A) := {Ax | x e 
R n } and TZ(A)- 1 denotes the orthogonal complement of 7Z(A). Given any b G R' m , we can uniquely write it 

as b K ( A ) + h n{A) ±, where b TC(A) is the projection of b onto TZ{A). ||A|| F := \JYa=i YTj=i l a ul 2 and ll A ll2 : = 
max x ^o ||Ax|| 2 / ||x|| 2 denotes the Frobenius norm and spectral norm, respectively. Let o\ > <T2 > • ■ ■ > 
°rank(A) De the non-zero singular values of A. We will usually refer to <r\ and <7 ran k(A) as ^max and er m i n , 
respectively. The Moore-Pensore pseudo-inverse of A is denoted by At BGL96I . Recall that 1 1 A^ 1 1 = 1 / Cmm. 
For any non-zero real matrix A, we define 

KF 2 (A):=!|A||^||At|| 2 . (2) 
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Related to this is the scaled square condition number introduced by Demmel in [Dem88J, see also ISV09II . 
It is easy to check that the above parameter k 2 (A) is related with the condition number of A, k 2 (A) := 
^max/^min' v ^ a me inequalities: k 2 (A) < k 2 (A) < rank (A) ■ n 2 (A). We denote by nnz (■) the number of non- 
zero entries of its argument matrix. We define the average row sparsity and average column sparsity of A by 
R avg and C avg/ respectively, as follows: 

m n 

Ravg := ^2 * nnz ( a(i) ) an d C avg := y^Pjnnz (A (i )) 

i=l 3 =l 
II l|2 2 1 1 ( '\ 1 1 ^ 2 

where p.,- := || Ay) || 2 / ||A|| F for every i £ [n] and qi := ||A W || 2 / ||A|| F for every i £ [m]. The following fact 
will be used extensively in the paper. 

Fact 1. Let A be any non-zero real m x n matrix and b £ R Tn . Denote by x LS := A^b. T/ze« x IS = A^b-^A). 

We frequently use the inequality 1 — t < cxp(-t) for every t < 1. We conclude this section by collecting a 
few basic facts from probability theory that will be frequently used. For any random variable X, we denote 
its expectation by E[X] or EX. If X is a non-negative random variable, Markov's inequality states that 
P (X > t) < t- 1 E[X]. Let X and Y be two random variables, then E[X + Y} = ELY] + E[Y}. We will refer 
to this fact as linearity of expectation. Let E\, £2, • ■ ■ , £1 be a set of events defined over some probability space 
holding with probabilities Pi,P2,-.-Pi respectively, then P {£\ U £2 U . . . U E\) < Y^\=iPi- We refer to this 
fact as union bound. 



3.1 Randomized Approximate Orthogonal Projection 



Algorithm 1 Randomized Orthogonal Projection 

l: procedure (A, b, T) > A € R mxn ,b e R m , T e N 

2: Initialize z (0) = b 

3: forfc = 0,l,2,...,T-l do 

4: Pick ] k e [n] with probability^- := ||A (i ) || 2 / ||A||p , j € [n] 

5: Setz( fe + 1 ) = (l m -^kA^) 

6: end for 

7: Output 

8: end procedure 



In this section we present a randomized iterative algorithm (AlgorithmQ} that, given any vector b 6 R m 
and a linear subspace of R m represented as the column space of a given matrix A, approximately computes 
the orthogonal projection of b onto the column space of A (denoted by b K ( A ), b TC( - A ) = AA^b), see HCRT11H 
for a different approach. 

Algorithm [I] is iterative. Initially, it starts with z' ) = b. At the k-th iteration, the algorithm randomly 
selects a column Ay) of A for some j, and updates z' fe ' by projecting it onto the orthogonal complement 
of the space of A^) . The claim is that randomly selecting the columns of A with probability proportional 
to their square norms implies that the algorithm converges to b K ( A )i in expectation. After T iterations, 
the algorithm outputs z^ T ) and by orthogonality b — zy> serves as an approximation for b^(A). The next 
theorem bounds the expected rate of convergence for Algorithm [T] 

Theorem 2. Let A £ R mx ", b £ R" 1 and T > 1 be the input to Algorithm^ Fix any integer k > 0. In exact 
arithmetic, after k iterations of Algorithm^it holds that 



E 



r(*0 



-b 



•R,(A) J 



< 1- 



1 



v 2 (A) 



>n{A) 
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Moreover, each iteration of Algorithm\T\requires in expectation (over the random choices of the algorithm) at most 
5C avg arithmetic operations. 



Remark 1. A suggestion for a stopping criterion for Algorithm[j]is to regularly check: 



< e for some given 



accuracy e > 0. It is easy to see that whenever this criterion is satisfied, it holds that ||b K(A) _L - z«|| 2 /||z«|| 2 < 
en F (A), i.e., b - z^ w b TC(A) . 

We devote the rest of this subsection to prove Theorem^] Define P(j) := I m - " u) for every j £ [n). 

II a (j)|| 2 

Observe that P(j)P(j) = P(j), i.e., P(j) is a projector matrix. Let X be a random variable over {1,2,..., n} 



that picks index j with probability Ayj || 
make use of the following fact. 



/||A||p. It is clear that E[P(X)] 



AA T / ||A||p. Later we will 



AA 1 



< 1 



Fact 3. For every vector u in the column space of A, it holds 

Define := z^ k > — b n ^± for every k > 0. A direct calculation implies that 

e^ = P(j k )e^. 



IIAII 



Indeed, e (fe > = z< fe ) - b 



K(A) J 



P{j k )^ k -V-h- 



K(A) J 



P(jfc)(e 



(fe-i). 



the definitions of e( fc ), z (fe) , e (fc x) and the fact that P(jfe)b 



K(A) J 



} K(A) 
J K(A)- 



0-b 



P0fc)e 



(fc-i) 



using 



for any j k £ [n] . Moreover, it is 



easy to see that for every k > e( fc ) is in the column space of A, since e(°) = b b n ^± = £ TZ(A), 

e (fc) _ P(j fc )e^ fe_1 ) and in addition P(jk) is a projector matrix for every j k £ [n]. 

Let X\ , X 2 , ... be a sequence of independent and identically distributed random variables distributed as 
X. For ease of notation, we denote by E^_i[-] = Ex k [■ I Xi,X 2 , ■ . . , X k _i], i.e., the conditional expectation 
conditioned on the first (k — 1) iteration of the algorithm. It follows that 



E 



fc— l 



< 



(fc-i) 



P{X k )e 



(fe-i) 



P(X k )P(X k )e^ 
AA T 



Efc-^P^)^*- 1 ), P(X fe )e( fc - x 

e^" 1 ), E k ^[P(X k )]e^ 

2 



I, 



< 



3 (fc-l) 



where we used linearity of expectation, the fact that P( ) is a projector matrix, Cauchy-Schwarz inequality 
and Fact[3] Repeating the same argument k — 1 times we get that 





2 <fl- 1 V 


e (o) 




2"V ^ (A)y ) 





Note that e<°) = b - b n 



(A)J 



J K(A) 



to conclude. 



Step 5 can be rewritten as z 



(fc+i) = z (fe) _ 



Hi*) 



At every iteration, the inner 



;<A( Jfc ), zW)/||A 0fc) | 

product and the update from z^ to z' fc+1 ^ require at most 5nnz (Ay fe )) operations for some jfc € [n]; hence 
in expectation each iteration requires at most Y^j=i 5pjnnz (A^) = 5C avg operations. 



3.2 Randomized Kaczmarz 

Strohmer and Vershynin proposed the following randomized variant of Kaczmarz algorithm (Algorithm^, 
see [SV09J for more details. The following theorem is a restatement of the main result of [SV09] without 
imposing the full column rank assumption. 



5 



Algorithm 2 Randomized Kaczmarz ISV09I 



l: procedure (A, b, T) > A e R rnxn , b G R 

2: Set x'- -' to be any vector in the row space of A 
3: for k = 0,1,2,...,T- 1 do 

II /■ "\ 1 1 ^ 2 

4: Pick ifc G [m] with probability qi := ||A W || 2 /||A|| F ,« G 

5- Set x< fe+1 ) - x« + ^-( x(fc) - A( '"') A fa) 

b. setx -x + || A(lfc) ||- 

6: end for 

7: Output X^ T ) 

8: end procedure 



Theorem 4. Let A G R mx ™, b G R m and T > 1 be the input to Algorithm^ Assume that Ax = b has a solution 
and denote x LS := A^b. In exact arithmetic, Algorithm\2\converges to x IS in expectation: 



E 



xW-x„ 



< 1- 



s?(A) 



AO) 



Vfc > 0. 



(3) 



Remark 2. T/ze akwe theorem has been proved in jSV09^ for the case of full column rank. Also, the rate of expected 



convergence in j|Sy09f is 1 — 1/k 2 (A) where k 2 (A) 
k 2 (A) is infinite whereas k 2 (A) is bounded. 



IIAII^Ax 



min (m,n) 



(A 1 A). Notice that if rank (A) < n, then 



We devote the rest of this subsection to prove Theorem 0] following [SV09|. The proof is based on the 
following two elementary lemmas which both appeared in [SV09]. However, in our setting, the second 
lemma is not identical to that in |SV09[. We deferred their proofs to the Appendix. 

Lemma 5 (Orthogonality). Assume that Ax = b has a solution and use the notation of Algorithm^ then x' fc+1 ) — 
x LS is perpendicular to x^ fc+1 ^ — x (fe ) for any k > 0. In particular, in exact arithmetic it holds that ||x( fc+1 ) — x LS || = 



r(*0 



r (fe+i) _ „(fc)| 



12' 



The above lemma provides a formula for the error at each iteration. Ideally, we seek to minimize the 
error at each iteration which is equivalent to maximizing 1 1 x( fc+1 ) — x' fc ^ 1 1 over the choice of the row projec- 
tions of the algorithm. The next lemma suggests that by randomly picking the rows of A reduces the error 
in expectation. 

Lemma 6 (Expected Error Reduction). Assume that Ax = b has a solution. Let Z be a random variable over 
[m] with distribution P (Z = i) = ^ ,J 2 and assume that is a vector in the row space of A. If x^ 1 ) := 
x( fc ) + bz f x — ^ — -A^ z ' (in exact arithmetic), then 

H A(Z) |l2 



x (*+l) _ x 



< 1- 



«?(A) 



xW-x LS 



(4) 



Theorem S] follows by iterating Lemma|6j we get that 



x( fe+1 > -x,, 



< 



1 - 



Kp 2 (A) 



x<°)-x LS 



3.3 Randomized Kaczmarz Applied to Noisy Linear Systems 

The analysis of Strohmer and Vershynin is based on the restrictive assumption that the linear system has 
a solution. Needell made a step further and analyzed the more general setting in which the linear system 
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does not have any solution and A has full column rank INeelOI . In this setting, it turns out that the ran- 
domized Kaczmarz algorithm computes an estimate vector that is within a fixed distance from the solution; 
the distance is proportional to the norm of the "noise vector" multiplied by k%(A) IINeelOI . The following 
theorem is a restatement of the main result in [NeelO| with two modifications: the full column rank as- 
sumption on the input matrix is dropped and the additive term 7 of Theorem 2.1 in [NeelO| is improved to 
Il w ll2 / II A Hp- The only technical difference here from UNeelOl is that the full column rank assumption is not 
necessary, so we defer the proof to the Appendix for completeness. 

Theorem 7. Assume that the system Ax = y has a solution for some y e R m . Denote by x* := A^y- Let 
denote the k-th iterate of the randomized Kaczmarz algorithm applied to the linear system Ax = b with b := y + w 
for any fixed w G R m , i.e., run Algorithm\2\with input (A, b). In exact arithmetic, it follows that 



E 



2 < f 1 



(5) 



In particular, 



x« 



< 1- 



K?(A) 



x<°> - x" 



4 Randomized Extended Kaczmarz 

Given any least squares problem, Theorem [7] with w = h n ^ A -j± tells us that the randomized Kaczmarz 
algorithm works well for least square problems whose least squares error is very close to zero, i.e., j | w || 2 m 0. 
Roughly speaking, in this case the randomized Kaczmarz algorithm approaches the minimum ^-norm least 
squares solution up to an additive error that depends on the distance between b and the column space of 
A. 

In the present paper, the main observation is that it is possible to efficiently reduce the norm of the 
"noisy" part of b, b K ( A )^ (using Algorithm []} and then apply the randomized Kaczmarz algorithm on 
a new linear system whose right hand side vector is now arbitrarily close to the column space of A, i.e., 
Ax « b-ft(A). This idea together with the observation that the least squares solution of the latter linear 
system is equal (in the limit) to the least squares solution of the original system (see Fact [I) implies a 
randomized algorithm for solving least squares. 

Next we present the randomized extended Kaczmarz algorithm which is a specific combination of the 
randomized orthogonal projection algorithm together with the randomized Kaczmarz algorithm. 



4.1 The algorithm 



We describe a randomized algorithm that converges in expectation to the minimum ^2-norm solution vector 
x LS (Algorithm |3|. The proposed algorithm consists of two components. The first component consisting of 
Steps 5 and 6 is responsible to implicitly maintain an approximation to b TC ( A ) formed by b — z^ k \ The 
second component, consisting of Steps 4 and 7, applies the randomized Kaczmarz algorithm with input 
A and the current approximation b — z^ of h-jz(A)/ i-e., applies the randomized Kaczmarz on the system 
Ax = b z( k \ Since b — iS- k > converges to b^A), x( fc ) will eventually converge to the minimum Euclidean 
norm solution of Ax = b^A) which equals to x LS = A^b (see FactQ}. 

The stopping criterion of Step |8] was decided based on the following analysis. Assume that the termi- 



nation criteria are met for some k > 0. Let 
definition of z^). Then, 



J TC(A) J 



w for some w G 7£(A) (which holds by the 



A T zW 



= A 



( b 7Z(A) 



= ||A T w|| > CT min (A) 



z^-b 



K(A) J 
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Algorithm 3 Randomized Extended Kaczmarz (REK) 



l: procedure (A, b, e) > A e R mxn , b £ R m , e > 

2: Initialize x (0) = and z (0) = b 
3: for k = 0,1,2,... do 

II ( '\ 1 1 ^ 2 

4: Pick ifc € [m] with probability ^ := || || 2 /||A|| F ,i G [777,] 

11 1 1 2 2 

5: Pick jfe £ [n] with probability pj := || Aq--, || 2 / j|A|| F , j £ [n] 

6- Set z< fe+1 ) - z< fc ) - ( a ^' zW ) a, ■ n 

7. Setx (fc+1 — x fc 4- — ^ v LA( Jfc 

/. setx -x + || A ( Ifc )||^ 

8: Check every 8 min(m, n) iterations and terminate if it holds: 



|AxW - (b-z( fc ))|L l|A T z( fe )|L 

— < e and ' ,, , „ < e. 



|A|| F ||xW|| 2 ^ ||A||2||x(*)| 



end for 

Output x^ 
end procedure 



By re-arranging terms and using the second part of the termination criterion, it follows that ||z( fc ) — b7^(A)-L || 2 — 

£ JjAJH || x (fe)|| NoW/ 



A(x 



(fc) 



< 



AxW - (b-z<*)) 



<e||A| 



<e||A| 



,(fc) 



A k ) 



I A 11 2 



b - z^ - Ax„ 



.(*) 



Ak) 



where we used the triangle inequality, the first part of the termination rule together with b^A) = Ax LS and 
the above discussion. Now, since x 1 ^, x LS £ 1Z(A T ), it follows that 



Ak) 



c( fc ) 



<£K F (A)(1 + K F (A)). 



(6) 



Equation (O demonstrates that the forward error of REK after termination is bounded. 



4.2 Rate of convergence 

The following theorem bounds the expected rate of convergence of Algorithm[3] 

Theorem 8. After T > 1 iterations, in exact arithmetic, Algorithm\3\with input A (possibly rank-deficient) and b 
computes a vector x^ T ' such that 



A T ) 



< 1 



«*(A) 



LT/2J 



[1 + 2k 2 (A))||x ts || 



2 

2 ■ 



Proof. For the sake of notation, set a = 1 - l/zc^ (A) and denote by E fe [-] := E[- | i , j , ii, j\, . . . , ik,jk], i-e., 
the conditional expectation with respect to the first k iterations of Algorithm [3] Observe that Steps 5 and 6 
are independent from Steps 4 and 7 of Algorithm [3) so Theorem |2] implies that for every I > 



!<0 - b 



7J(A) J 



Ml II II II 

||bK(A) || 2 < ||b TC ( A ) || 2 . 



(7) 
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Fix a parameter k* := [T/2\ . After the fc*-th iteration of Algorithm^ it follows from Theorem [7] (Inequal- 
ity ©) that 



E 



(fc'-i) 



r (fe* 



< a 



X ^ ^ X LS 



2 \\*>n(Ay -z (fe * 1} 



|A||J 



Indeed, the randomized Kaczmarz algorithm is executed with input (A, b — z^ k ' ^) and current estimate 
vector x( fc Sety = b-^A) and w = b^^x — z^ k ~ x > in Theorem[7]and recall that x LS = A^b = A^b7j( A ) = 
Aty. 

Now, averaging the above inequality over the random variables i±, j±, i 2 , 32, ■ ■ ■ , ife*-i, jk*-i an d using 
linearity of expectation, it holds that 



f (fc*) 



X ^ ^ X LS 



2 E\\b n(A) ± -z^-^\ 



2 

2 lib 



(8) 



K(A) I 



< ... <a 



x(°) - x, 



2 

F 

fe*-2 

E c 

Z=0 



by Ineq. 



l ll P ^(A)|| 2 

l|A||? 



, (repeat the above k* — 1 times) 



< x 



I ll b ^(A)|l2 



E 



, since a < 1 and x' -* = 0. 



Simplifying the right hand side using the fact that a ' = T^a = k f (A)/ it follows 



2 11 1 1 2 2 

< II X ls|| 2 + || b K(A)|| 2 /^min- 



(9) 



Moreover, observe that for every I > 



>n(A) J 



< a 



l+k' 



5 ^(A) I 



>R(A) I 



Now for any k > 0, similar considerations as Ineq. © implies that 



(10) 



Ak+k* 



<aE 



x ( fc+fc *-i)_ XLs 



2 E||b w - z (fc-i+fc*) 



< . . . < a E 



< a fc E 



2 a 
2 



1 „(fc-i)- i E || b TC(A)-' 



E< 

i=0 



J TC(A)|| 2 



|2 fe-1 



^ a 1 (by Ineq. ©) 



< ot ( I j x LS 1 1 2 + ||b TC ( A )|| 2 /CT^iJ +a 



1=0 

fe* ll b . . II 2 /. 



(by induction) 



/a 2 min (by Ineq.©) 



|x LS || 2 + (a fe + a fe *)||b TC(A) || 2 /a 2 



< a k ||x LS ||2 + {a k +a k ") K 2 (A) \\y: LS \\t, since ||b TC(A) || < a max ||x LS || 2 



<a k '{l + 2n 2 (A))||x LS ||J 



To derive the last inequality, consider two cases. If T is even, set k = k* , otherwise set k = k* + 1. In both 
cases, (a k + a k " ) < 2a fc * . □ 
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4.3 Theoretical bounds on time complexity 

In this section, we discuss the running time complexity of the randomized extended Kaczmarz (Algo- 
rithm [3). Recall that REK is a Las- Vegas randomized algorithm, i.e., the algorithm always outputs an 
"approximately correct" least squares estimate (satisfying ©) but its runnning time is a random variable. 
Given any fixed accuracy parameter e > and any fixed failure probability < 8 < 1 we bound the number 
of iterations required by the algorithm to terminate with probability at least 1 — 8. 

Lemma 9. Fix an accuracy parameter < e < 2 and failure probability < 8 < 1. In exact arithmetic, Algorithm^ 
terminates after at most 

iterations with probability at least 1 — 8. 



8e 2 



Proof. Denote a := 1 — 1 / n 2, (A) for notational convenience. It suffices to prove that with probability at least 
1 — 5 the conditions of Step[5]of Algorithm[3]are met. Instead of proving this, we will show that: 

1. With probability at least 1 - 8/2: ||(b - z( T *)) - b K(A )|| 2 < £ ||b K(A ) || 2 /4. 

2. With probability at least 1 - 8/2: ||x (T *) - x LS || 2 < e ||x LS || 2 /4. 

Later we prove that Items (1) and (2) imply the Lemma. First we prove Item (1). By the definition of the 
algorithm, 



(!< b 



J TC(A) 



^ e ll b K(A)|| 2 /4 



'71(A)- 1 

16E||z( T *) 



J 7e(A) J 



> £ b 



TC(A)|| 2 



/16 



< 

e 2 ||b TC( A)|| 2 
< 16a T */e 2 < 5/2 

the first equality follows since b — b^(A) = i>Tz{A) ± > the second inequality is Markov's inequality, the third 
inequality follows by Theorem[2l and the last inequality since T* > k 2 (A) ln(^-). 
Now, we prove Item (2): 



< £ ||x LS || 2 /4 < 



16E||x( T *> -x„ 



x, 



< 16«L^/2J (1 + 2k 2 (A))/£ 2 < J/2 _ 



the first inequality is Markov's inequality, the second inequality follows by Theorem[8j and the last inequal- 
ity follows provided that T* > 2k f 2 (A) In ( 32(1+ // (A)) ) 

A union bound on the complement of the above two events (Item (1) and (2)) implies that both events 
happen with probability at least 1 — 8. Now we show that conditioning on Items (1) and (2), it follows that 
REK terminates after T* iterations, i.e., 



Ax< T *) -(b-z< T *)) 



<e||A|| 



AT*] 



and 



|A T z«| 



< e. 



We start with the first condition. First, using triangle inequality and Item 2, it follows that 



> llx 



X LS 



> (l-e/^IMs 



(11) 
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Now, 



Ax< T *) - (b 



«(r*)> 



< 



< 



Ax( T *)-b 



A(x 



(T* 



TZ(A) 
X LS ) 



(b-Z< T *>) 



j tc(a) 



} K(A)|| 2 



/4 



x( T *)-x, 



e||Ax LS || 2 /4 



< ecr raax ||x 

f/g 
l-e/4 



< 



LS 112 / 2 

X (T*) 



< e 



where the first inequality is triangle inequality, the second inequality follows by Item 1 and b^^) = Ax LS , 
the third and forth inequality follows by Item 2 and the fifth inequality holds by Inequality (TTTb and the last 
inequality follows since e < 2. The second condition follows since 



A T z< T * 



< 



AT ( b TC(A) J 

e/4 



J 7e(A) J 



< SCTmax |PTC(A) || 2 



/4<e<axl|x L s|| 2 /4 



l-e/4 



r (T*) 



<£||A|| 



r (T*) 



the first equation follows by orthogonality, the second inequality assuming Item (2), the third inequality 
follows since b^/n = Ax LS , the forth inequality follows by tfTll and the final inequality since e < 2. □ 

Lemma|9]bounds the number of iterations with probability at least 1 — 5, next we bound the total number 
of arithmetic operations in worst case (Eqn. Jl2l l) and in expectation (Eqn. Jl3)l ). Let's calculate the compu- 
tational cost of REK in terms of floating-point operations (flops) per iteration. For the sake of simplicity, 
we ignore the additional (negligible) computational overhead required to perform the sampling operations 
(see Section|5]for more details) and checking for convergence. 

Each iteration of Algorithm [3] requires four level-1 BLAS operations (two DDOT operations of size m 
and n, respectively, and two DAXPY operations of size n and m, respectively) and additional four flops. In 
total, 4(m + n) + 2 flops per iteration. 

Therefore by Lemma |9j with probability at least 1 — 8, REK requires at most 



5(m + n)-T* < 10(m + n)rank (A) k 2 (A) In 



32(1 + 2k 2 (A)) 
IT 2 



(12) 



arithmetic operations (using that k 2 (A) < rank (A) k 2 (A)). 

Next, we bound the expected running time of REK for achieving the above guarantees for any fixed e and 
S. Obviously, the expected running time is at most the quantity in (12)| . However, as we will see shortly the 
expected running time is proportional to nnz (A) instead of (m + n)rank (A). 

Exploiting the (possible) sparsity of A, we first show that each iteration of Algorithm [3] requires at most 
5(C av g + Ravg) operations in expectation. For simplicity of presentation, we assume that we have stored A 
in compressed column sparse format and compressed row sparse format |BBC + 87| . 

Indeed, fix any i k £ [to] and jk € [n] at some iteration fc of Algorithm [3] Since A is both stored in 
compressed column and compressed sparse format, Steps 7 and Step 8 can be implemented in 5nnz (A( J( .)) 
and 5nnz (A^ k '), respectively. 

By the linearity of expectation and the definitions of C avg and R avg , the expected running time after T* 
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iterations is at most 5T*(C avg + R avg ). It holds that (recall thatpj = ||A (j) \\l / \\A\\l) 



C a v g T^^^||A, ) ||Jnn Z (A, ) )ji^ln( 





using the definition of C avg and T* in the first equality and the fact that max,^^] ||A(j)|| < cr^ ax and 
Y^j=i nnz (A(j)) = nnz(A) in the first and second inequality. A similar argument shows that Ravg?"* < 




Hence by Lemma[9j with probability at least 1 — 8, the expected number of arithmetic operations of REK 
is at most 



In other words, the expected running time analysis is much tighter than the worst case displayed in Equa- 
tion dl2] | and is proportional to nnz (A) times the square condition number of A as advertised in the abstract. 

5 Implementation and Experimental Results 

5.1 Implementation 

The proposed algorithm has been entirely implemented in C. We provide three implementation of Algo- 
rithm |3 REK-C, REK-BLAS and REK-BLAS-PRECOND. REK-C corresponds to a direct translation of Al- 
gorithm|3]to C code. REK-BLAS is an implementation of REK with two additional technical features. First, 
REK-BLAS uses level-1 BLAS routines for all operations of Algorithm [3] and secondly REK-BLAS addition- 
ally stores explicitly the transpose of A for more efficiently memory access of both the rows and columns of 
A using BLAS. REK-BLAS-PRECOND is an implementation of REK-BLAS that additionally supports upper 
triangular preconditioning; we used Blendenpik's preconditioning code to ensure a fair comparison with 
Blendenpik (see Section l5^2t . In the implementations of REK-C and REK-BLAS we check for convergence 
every 8 min(m, n) iterations. 

Moreover, all implementations include efficient code that handles sparse input matrices using the com- 
pressed column (and row) sparse matrix format |BBC + 87| . 

Sampling from non-uniform distributions The sampling operations of Algorithm [3] (Steps 4 and 5) are 
implemented using the so-called "alias method" for generating samples from any given discrete distribu- 
tion HWal77llVos9TH . The alias method, assuming access to a uniform random variable on [0, 1] in constant 
time and linear time preprocessing, generates one sample of the given distribution in constant time IIVos91ll . 
We use an implementation of W. D. Smith that is described in ISmi021 and C's drand48() to get uniform sam- 
ples from [0, 1]. 

5.2 Experimental Results 



We report our experimental results in this section. We compared the randomized extended Kaczmarz 
(REK-C, REK-BLAS,REK-BLAS-PRECOND) algorithm to LAPACK's DGELSY and DGELSD least squares 




(13) 
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Random sparse overdetermined least squares 

9r 




°2 4 6 8 10 12 14 16 18 20 

Number of rows (x 1000) 



Random sparse underdetermined least squares 

18r 




(a) Random sparse matrices having 800 columns and density 0.25. (b) Random sparse matrices having 800 rows and density 0.25. 



Figure 1: Figures depict the running time (in seconds) vs increasing number of rows /columns (scaled by 
1000) for the case of random sparse overdetermined (Figure 1(a)} and underdetermined (Figure [l(b)| least 
squares problems. 

solvers, Blendenpicl<0 (version 1.3, IAMT10I ) and MATLAB's backslash operator. LSRN IMSM11I did not 
perform well under a setup in which no parallelization is allowed as the one used here, so we do not include 
LSRN's performance. DGELSY uses QR factorization with pivoting and DGELSD uses the singular value 
decomposition. We use MATLAB with version 7.9.0.529 (R2009b). In addition, we use MATLAB's included 
BLAS and LAPACK packages and we call LAPACK's functions from MATLAB using MATLAB's CMEX 
technology which allows us to measure only LAPACK's elapsed time. We should highlight that MATLAB 
is used as a scripting language and no MATLAB-related overheads have been taken under consideration. 
Blendenpik requires the FFTW librarjH; we used FFTW-3.3.3. To match the accuracy of LAPACK's direct 
solvers, we fixed e in Algorithm [3] to be 10e-14. Moreover, during our experiments we ensured that the 
residual error of all the competing algorithms were about of the same order of magnitude. 

We used a Pentium(R) Dual-Core E5300 (2.60GHz) equipped with 5GB of RAM and compiled our source 
code using GCC-4.7.2 under Linux operating system. All running times displayed below are measured 
using the ftime Linux system call by taking the average of the running time of 10 independent executions. 

We experimented our algorithm under three different distributions of random input matrices (sparse, 
dense and ill-conditioned) under the setting of strongly rectangular settings of least squares instances. In 
all cases we normalized the column norms of the input matrices to unity and generate the right hand side 
vector b having Gaussian entries of variance one. 

Sparse least squares We tested our algorithm in the overdetermined setting of random sparse m x n 
matrices with n = 800 and m = 2000, 3000, . . . , 20000 and density 0.25. We also tested REK-BLAS on the 
underdetermined case where m = 800 and n = 2000, 3000, . . . , 20000. In both cases, the density of the 
sparse matrices was set to 0.25 (for even sparser matrices REK-BLAS performed even better compared to 
all other mentioned methods). To generate these sparse matrix ensembles, we used MATLAB's sprandn 
function with variance one. The results are depicted in Figured] Both plots demonstrate that REK-BLAS is 
superior on both the underdetermined (Figure [l(b)} and overdetermined case (Figure 1(a)} . It is interesting 
that REK-BLAS performs well in the underdetermined case. 

Dense and well-conditioned least squares In this scenario, we used random overdetermined dense mxn 
matrices with much more rows than columns, i.e., we set n = 500 and m = 1000, 2000, . . . , 20000. We 



2 Available at http://www.mathworks.com/matlabcentral/fileexchange/25241-blendenpik. Blendenpik's default settings were 
used. 

3 http : / / www. f f tw. org / 
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Random den: 



overdete: 



d least squa: 



-O-REK-C 

REK-ELAS 

-*-DGELSD 
DGELSY 
BLENDENP I K 




Random dense under determined least squ; 



Number of 



(a) Random dense matrices having 500 columns. 




(b) Random dense matrices having 500 rows. 



Figure 2: Figures depict the running time (in seconds) vs increasing number of rows /columns (scaled by 
1000) for the case of random dense overdetermined (Figure 2(a) I and underdetermined (Figure [2 (b)\ least 
squares problems. 



also tested REK-BLAS on the underdetermined case where m = 500 and n = 1000, 2000, . . . , 20000. We 
generated this set of matrices using MATLAB's randn function with variance ten. We depicted the results 
in Figure 2(a) In the overdetermined case (Figure |2(a)fr , REK-BLAS is marginally superior compared to 



LAPACK's routines whereas REK-C (as a naive implementation of Algorithm [3j> is inferior. Blendepik is 
the winner in this case. Interestingly REK-BLAS almost matches the performance of Blendenpik in the 
underdetermined case, see Figure [2(b)] 



Dense and ill-conditioned least squares Finally we tested all algorithms under a particular case of 
random ill-conditioned dense matrices with n = 500 and m = 1000, 2000, . . . , 20000. Namely we used 
Higham's randSVD function for generating these matrices [Hig89. Hig96|. More precisely we set the condi- 
tion number of these matrices to be 10e6; set the top singular value to one and the rest to 10e-6. The results 
are displayed in Figure P5] Unfortunately in the ill-conditioned setting REK-BLAS-PRECOND is inferior 
compared to LAPACK's routines and Blendepik. We also verified the results of I1AMT101 that Blendepik is 
superior compared to LAPACK's least squares solvers in this setting. 
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7 Appendix 



We present the proof of known facts from previous works for completeness. 

Proof, (of Lemma|5]l It suffices to show that (x' fe+1 ' — x LS , x' fe+1 ' — x' fe ') = 0. For notational convenience, 



let a, := — — jn^yjp — f° r ever y i £ [ m \- Assume that x' fe+1 ' = x' fc ' + ai k A^ k ' ) for some arbitrary ik G [m]. 
Then, 



^ -x LS , x (*+i) _ X W^ = (x^ 1 ' -x LS , a ifc A(^)) = 



(x^ +1 ), A^))-6 ifc ) 



using the definition of x( fc+1 ), and the fact that (x LS , A^*)) = b ik since x LS is a solution to Ax = b. Now, 
by the definition of a ik , (x( fe+1 \ A^) = (xW, A^)) + a ik \\A^ \\l = (xW, A^) + b lk - (x( fe \ A^) = 



Proof, (of LemmaO In light of Lemma[5j it suffices to show that Ez ||x' fc+1 ^ — \\ 2 > t^a) || x ^ fc ' — x l: 
By the definition of x( fc+1 ), it follows 





2 




x (fc+l) _ x w 




( 




2 





'bz-(xW A( z ))^ 



A (Z) 



|A(x LS -x«) 



(x LS -xW A^)) S 
llA^)||" 



By hypothesis, x' fe ' is in the row space of A for any k when x(°) is; in addition, the same is true for x LS by 
the definition of pseudo-inverse ICL961 . Therefore, ||A(x LS - x^'>)|| 2 > cr min || x LS — x' ' . □ 

Proof, (of Theorem^ As in INeelOL for any i £ [to] define the affine hyper-planes: 

Hi :={x:(AW, x) = Vi } 

UT :={x:(aW,x)= W + ^} 

Assume for now that at the /c-th iteration of the randomized Kaczmarz algorithm applied on (A, b), the 
i-th row is selected. Note that x' fe ' is the projection of i6 k ~ x ^> on H % f i by the definition of the randomized 
Kaczmarz algorithm on input (A, b). Let us denote the projection of x^'" 1 ) on Hi by x^ fe '. The two affine 
hyper-planes Hi, Hf* are parallel with common normal A'*), so is the projection of x( fe ) on Hi and the 
minimum distance between Hi and Hf i equals \wi\/ \\A^\\ 2 . In addition, x* e Hi since (x*, A^) = yt, 
therefore by orthogonality we get that 



x (fc) _ x * 
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x« - X* 
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x w_ x w 




2 
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(14) 



Since xW is the projection of x^ -1 -* onto Hi (that is to say, xW is a randomized Kaczmarz step applied on 
input (A, y) where the i-th row is selected on the fc-th iteration) and x^' -1 ) is in the row space of A, Lemma[6] 
tells us that 

1 



E 



x<*> - x* 



< 1- 



«?(A) 



X (*-D_ x * 



(15) 



Note that for given selected row i we have ||x^ fe ^ — x( fc ) \ \t = 



rows of A we have that 



l A<i) ll 



-; by the distribution of selecting the 



#-xW 
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A' 



(16) 
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Inequality ((5) follows by taking expectation on both sides of Equation l fl4t and bounding its resulting right 
hand side using Equations l fT5t and (16). Applying Inequality (01 inductively, it follows that 



< 1 



f(A) 



,(o) 



,2 k 



«?(A) 



where we used that is in the row space of A. The latter sum is bounded above by YliLo ( ^ 



II A||? /a^ 



?(A) 



□ 
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